As well as driving the Wikipedia website, a lot of facts that are recorded as such in Wikipedia are exposed in a special sort of database known as a Linked Data database called DBPedia.
So for example, on Wikipedia we have The Charlotte and on DBPedia we also have the The Charlotte, albeit in a form that's more intended for machine consumption.
With some help form code, and a query language called SPARQL, we can write queries over the facts known to Wikipedia...
...like a list of venues, with locations...
..cos this is what does the work and by writing code here we can reuese it and write less code elsewhere...
%%capture
#Install some essential packages
%pip install SPARQLWrapper pandas folium
# Import the necessary packages
from SPARQLWrapper import SPARQLWrapper, JSON
# Add some helper functions
# A function that will return the results of running a SPARQL query with
# a defined set of prefixes over a specified endpoint.
# It follows the same five-step process apart from creating the query, which
# is provided as an argument to the function.
def runQuery(endpoint, prefix, q):
''' Run a SPARQL query with a declared prefix over a specified endpoint '''
sparql = SPARQLWrapper(endpoint)
sparql.setQuery(prefix+q) # concatenate the strings representing the prefixes and the query
sparql.setReturnFormat(JSON)
return sparql.query().convert()
# Import pandas to provide facilities for creating a DataFrame to hold results
import pandas as pd
# Function to convert query results into a DataFrame
# The results are assumed to be in JSON format and therefore the Python dictionary will have
# the results indexed by 'results' and then 'bindings'.
def dict2df(results):
''' A function to flatten the SPARQL query results and return the column values '''
data = []
for result in results["results"]["bindings"]:
tmp = {}
for el in result:
tmp[el] = result[el]['value']
data.append(tmp)
df = pd.DataFrame(data)
return df
# Function to run a query and return results in a DataFrame
def dfResults(endpoint, prefix, q):
''' Generate a data frame containing the results of running
a SPARQL query with a declared prefix over a specified endpoint '''
return dict2df(runQuery(endpoint, prefix, q))
# Print a limited number of results of a query
def printQuery(results, limit=''):
''' Print the results from the SPARQL query '''
resdata = results["results"]["bindings"]
if limit != '':
resdata = results["results"]["bindings"][:limit]
for result in resdata:
for ans in result:
print('{0}: {1}'.format(ans, result[ans]['value']))
print()
# Run a query and print out a limited number of results
def printRunQuery(endpoint, prefix, q, limit=''):
''' Print the results from the SPARQL query '''
results = runQuery(endpoint, prefix, q)
printQuery(results, limit)
The query language is a bit weird and can get a bit hard to read, so we make aliases to simplify the clutter in our queries...
# Define any prefixes
prefix = '''
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX ouseful:<http://ouseful.info/>
'''
The endpoint is where machines go to ask questions...
#Declare the DBPedia endpoint
endpoint="http://dbpedia.org/sparql"
sparql = SPARQLWrapper(endpoint)
Now let's phrase a question...
DBPedia, can you give me the names and geo-coordinates of venues in England?
q = '''
SELECT DISTINCT ?venue_name ?lat ?lon
WHERE {
?venue rdfs:label ?venue_name .
?venue geo:lat ?lat .
?venue geo:long ?lon .
?venue dct:subject ?is_en_venue .
?is_en_venue skos:broader dbc:Music_venues_in_England .
FILTER (langMatches(lang(?venue_name), "en"))
} LIMIT 1000
'''
Now we actually pose the question and get a response back..
df = dfResults(endpoint, prefix, q)
df
venue_name | lat | lon | |
---|---|---|---|
0 | The Coronet | 51.4948 | -0.0989 |
1 | Pigalle Club | 51.5095 | -0.135 |
2 | The Spitz | 51.5197 | -0.0747222 |
3 | First Direct Arena | 53.8031 | -1.54222 |
4 | First Direct Arena | 53.8031 | -1.54222 |
... | ... | ... | ... |
259 | The Ram Folk Club | 51.3823 | -0.3414 |
260 | Worthing Leisure Centre | 50.8167 | -0.408758 |
261 | Workington Opera House | 54.6438 | -3.5443 |
262 | Bradford Odeon | 53.7925 | -1.7565 |
263 | Guildford Civic Hall | 51.2386 | -0.5663 |
264 rows × 3 columns
Whenever you work with data, you need to tidy it up. Here, we make sure the co-ordinates are treated as numbers and get rid of any rows that contain missing values.
df['lat'] = df['lat'].astype(float)
df['lon'] = df['lon'].astype(float)
df = df.dropna(how='any', axis=1)
#Preview the data
df
venue_name | lat | lon | |
---|---|---|---|
0 | The Coronet | 51.4948 | -0.098900 |
1 | Pigalle Club | 51.5095 | -0.135000 |
2 | The Spitz | 51.5197 | -0.074722 |
3 | First Direct Arena | 53.8031 | -1.542220 |
4 | First Direct Arena | 53.8031 | -1.542220 |
... | ... | ... | ... |
259 | The Ram Folk Club | 51.3823 | -0.341400 |
260 | Worthing Leisure Centre | 50.8167 | -0.408758 |
261 | Workington Opera House | 54.6438 | -3.544300 |
262 | Bradford Odeon | 53.7925 | -1.756500 |
263 | Guildford Civic Hall | 51.2386 | -0.566300 |
264 rows × 3 columns
Maps are actually quite easy to work with in code... Often just a couple of lines to pull stuff together, even fewer if we use some magic (but maybe later for that..).
#folium is a package for doing stuff with maps
import folium
For each line of our dataset, plot a corresponding marker on a map...
m = folium.Map(location=[55, 0], zoom_start=5)
for i in range(0,len(df)):
folium.Marker([df.iloc[i]['lat'], df.iloc[i]['lon']],
popup=df.iloc[i]['venue_name']).add_to(m)
m
We can also save the map to an HTML file that we can share around, pop onto websites, etc..
m.save('venues.html')